AITopics | image show

Collaborating Authors

image show

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Sherlock: Self-Correcting Reasoning in Vision-Language Models

Ding, Yi, Zhang, Ruqi

arXiv.org Artificial IntelligenceOct-24-2025

Reasoning Vision-Language Models (VLMs) have shown promising performance on complex multimodal tasks. However, they still face significant challenges: they are highly sensitive to reasoning errors, require large volumes of annotated data or accurate verifiers, and struggle to generalize beyond specific domains. To address these limitations, we explore self-correction as a strategy to enhance reasoning VLMs. We first conduct an in-depth analysis of reasoning VLMs' self-correction abilities and identify key gaps. Based on our findings, we introduce Sherlock, a self-correction and self-improvement training framework. Sherlock introduces a trajectory-level self-correction objective, a preference data construction method based on visual perturbation, and a dynamic $β$ for preference tuning. Once the model acquires self-correction capabilities using only 20k randomly sampled annotated data, it continues to self-improve without external supervision. Built on the Llama3.2-Vision-11B model, Sherlock achieves remarkable results across eight benchmarks, reaching an average accuracy of 64.1 with direct generation and 65.4 after self-correction. It outperforms LLaVA-CoT (63.2), Mulberry (63.9), and LlamaV-o1 (63.4) while using less than 20% of the annotated data.

arxiv preprint arxiv, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2505.22651

Genre: Research Report > New Finding (0.66)

Industry: Education > Curriculum > Subject-Specific Education (0.45)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.66)

Add feedback

Unifying Vision-Language Latents for Zero-label Image Caption Enhancement

Byun, Sanghyun, Guack, Jung Ick, Odema, Mohanad, Lee, Baisub, Song, Jacob, Chung, Woo Seong

arXiv.org Artificial IntelligenceOct-16-2025

Vision-language models (VLMs) achieve remarkable performance through large-scale image-text pretraining. However, their reliance on labeled image datasets limits scalability and leaves vast amounts of unlabeled image data underutilized. To address this, we propose Unified Vision-Language Alignment for Zero-Label Enhancement (ViZer), an enhancement training framework that enables zero-label learning in image captioning, providing a practical starting point for broader zero-label adaptation in vision-language tasks. Unlike prior approaches that rely on human or synthetically annotated datasets, ViZer actively aligns vision and language representation features during training, enabling existing VLMs to generate improved captions without requiring text labels or full retraining. We demonstrate ViZer's advantage in qualitative evaluation, as automated caption metrics such as CIDEr and BERTScore often penalize details that are absent in reference captions. Applying ViZer on SmolVLM-Base and Qwen2-VL, we observe consistent qualitative improvements, producing captions that are more grounded and descriptive than their baseline.

machine learning, natural language, vizer gt, (16 more...)

arXiv.org Artificial Intelligence

2510.12931

Country:

North America > United States (0.46)
Asia (0.46)

Genre: Research Report (0.64)

Industry:

Transportation > Ground > Road (0.46)
Leisure & Entertainment > Sports > Tennis (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.67)

Add feedback

Mitigating Hallucination in Multimodal Reasoning via Functional Attention Control

Lu, Haolang, Chu, Bolun, Fu, WeiYe, Nan, Guoshun, Liu, Junning, Pan, Minghui, Li, Qiankun, Yu, Yi, Wang, Hua, Wang, Kun

arXiv.org Artificial IntelligenceOct-14-2025

Multimodal large reasoning models (MLRMs) are rapidly advancing vision-language reasoning and are emerging as a foundation for cross-modal intelligence. Hallucination remains a persistent failure mode, manifesting itself as erroneous reasoning chains and misinterpretation of visual content. In this study, we observe that attention heads exhibit a staged division: shallow heads predominantly serve perception, while deeper heads shift toward symbolic reasoning, revealing two major causes of hallucination, namely perceptual bias and reasoning drift. To address these issues, we propose a lightweight and interpretable two-step plugin, Functional Head Identification and Class-conditioned Rescaling, which locates perception- and reasoning-oriented heads and regulates their contributions without retraining. Evaluations on three real-world MLRMs (Kimi-VL, Ocean-R1, R1-Onevision), six benchmarks across three domains, and four baselines show that our plugin achieves an average improvement of 5% and up to 15%, with only <1% additional computation and 9% of baseline latency. Our approach is completely model-agnostic and significantly enhances both the reliability and interpretability of the off-the-shelf MLRMs, thereby enabling their safe deployment in high-stakes applications. Our code is available at https://anonymous.4open.science/r/Functional-Attention-Control.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2510.10285

Country:

North America > United States (0.69)
Asia (0.68)
Europe (0.46)

Genre: Research Report > New Finding (0.66)

Industry:

Government > Voting & Elections (1.00)
Government > Regional Government (0.93)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Textual interpretation of transient image classifications from large language models

Stoppa, Fiorenzo, Bulmus, Turan, Bloemen, Steven, Smartt, Stephen J., Groot, Paul J., Vreeswijk, Paul, Smith, Ken W.

arXiv.org Artificial IntelligenceOct-9-2025

Modern astronomical surveys deliver immense volumes of transient detections, yet distinguishing real astrophysical signals (for example, explosive events) from bogus imaging artefacts remains a challenge. Convolutional neural networks are effectively used for real versus bogus classification; however, their reliance on opaque latent representations hinders interpretability. Here we show that large language models (LLMs) can approach the performance level of a convolutional neural network on three optical transient survey datasets (Pan-STARRS, MeerLICHT and ATLAS) while simultaneously producing direct, human-readable descriptions for every candidate. Using only 15 examples and concise instructions, Google's LLM, Gemini, achieves a 93% average accuracy across datasets that span a range of resolution and pixel scales. We also show that a second LLM can assess the coherence of the output of the first model, enabling iterative refinement by identifying problematic cases. This framework allows users to define the desired classification behaviour through natural language and examples, bypassing traditional training pipelines. Furthermore, by generating textual descriptions of observed features, LLMs enable users to query classifications as if navigating an annotated catalogue, rather than deciphering abstract latent spaces. As next-generation telescopes and surveys further increase the amount of data available, LLM-based classification could help bridge the gap between automated detection and transparent, human-level understanding.

classification, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

doi: 10.1038/s41550-025-02670-z

2510.06931

Country: Europe > Netherlands (0.28)

Genre: Research Report > New Finding (0.46)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

GSM8K-V: Can Vision Language Models Solve Grade School Math Word Problems in Visual Contexts

Yuan, Fan, Yan, Yuchen, Jiang, Yifan, Zhao, Haoran, Feng, Tao, Chen, Jinyan, Lou, Yanwei, Zhang, Wenqi, Shen, Yongliang, Lu, Weiming, Xiao, Jun, Zhuang, Yueting

arXiv.org Artificial IntelligenceSep-30-2025

Vision language models (VLMs) achieve unified modeling of images and text, enabling them to accomplish complex real-world tasks through perception, planning, and reasoning. Among these tasks, reasoning is particularly representative, with mathematical reasoning serving as a prominent example. It highlights the high-level capability of VLMs to comprehend mathematical information in images and to perform sophisticated reasoning. Recently, numerous visual mathematical reasoning benchmarks have been proposed, but they are often restricted to geometry, lack coverage of math word problems, and rarely assess reasoning across multiple images. To address these gaps, we introduce GSM8K-V, a purely visual multi-image mathematical reasoning benchmark. GSM8K-V is built by systematically mapping each sample from the widely used text-based GSM8K into visual form. Through a carefully designed automated image-generation pipeline combined with meticulous human annotation, we curate 1,319 high-quality samples. We evaluate a wide range of open-source and closed-source models on GSM8K-V. Results show that although existing VLMs have nearly saturated performance on text-based GSM8K, there remains substantial room for improvement on GSM8K-V. For example, the best-performing model, Gemini-2.5-Pro, achieves 95.22% accuracy on GSM8K but only 46.93% on GSM8K-V. We conduct a comprehensive analysis of GSM8K-V, examining the limitations of current models as well as potential directions for improvement. GSM8K-V offers a new perspective on visual mathematical reasoning and establishes a benchmark to guide the development of more robust and generalizable VLMs.

information, large language model, machine learning, (21 more...)

arXiv.org Artificial Intelligence

2509.2516

Country: Europe > Austria (0.27)

Genre: Research Report > New Finding (0.48)

Industry:

Health & Medicine > Consumer Health (0.93)
Leisure & Entertainment (0.68)
Education (0.68)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.91)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.91)

Add feedback

Why Settle for One? Text-to-ImageSet Generation and Evaluation

Jia, Chengyou, Shen, Xin, Dang, Zhuohang, Dang, Zhuohang, Xia, Changliang, Wu, Weijia, Zhang, Xinyu, Qian, Hangwei, Tsang, Ivor W., Luo, Minnan

arXiv.org Artificial IntelligenceSep-26-2025

Despite remarkable progress in Text-to-Image models, many real-world applications require generating coherent image sets with diverse consistency requirements. Existing consistent methods often focus on a specific domain with specific aspects of consistency, which significantly constrains their generalizability to broader applications. In this paper, we propose a more challenging problem, Text-to-ImageSet (T2IS) generation, which aims to generate sets of images that meet various consistency requirements based on user instructions. To systematically study this problem, we first introduce $\textbf{T2IS-Bench}$ with 596 diverse instructions across 26 subcategories, providing comprehensive coverage for T2IS generation. Building on this, we propose $\textbf{T2IS-Eval}$, an evaluation framework that transforms user instructions into multifaceted assessment criteria and employs effective evaluators to adaptively assess consistency fulfillment between criteria and generated sets. Subsequently, we propose $\textbf{AutoT2IS}$, a training-free framework that maximally leverages pretrained Diffusion Transformers' in-context capabilities to harmonize visual elements to satisfy both image-level prompt alignment and set-level visual consistency. Extensive experiments on T2IS-Bench reveal that diverse consistency challenges all existing methods, while our AutoT2IS significantly outperforms current generalized and even specialized approaches. Our method also demonstrates the ability to enable numerous underexplored real-world applications, confirming its substantial practical value. Visit our project in https://chengyou-jia.github.io/T2IS-Home.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2506.23275

Country:

North America > United States (0.28)
Europe (0.28)

Genre:

Workflow (1.00)
Research Report > New Finding (0.67)
Research Report > Experimental Study (0.46)

Industry:

Media > Film (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(2 more...)

Add feedback

6dd16c884345ad63e4708367222410e5-Supplemental-Conference.pdf

Neural Information Processing SystemsAug-15-2025, 16:32:18 GMT

We conducted a comparison between the adjoint method, the Green's function method and a classical Gaussian process on the ordinary differential equation model presented in section 4.1.

artificial intelligence, gaussian process, machine learning, (15 more...)

Neural Information Processing Systems

Country: North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Welp, Nvidia's RTX 5090 can crack an 8-digit password in 3 hours

PCWorldMay-12-2025, 14:48:10 GMT

I have bad news for everyone with weak passwords. A hacker can guess your laziest random passwords in the same amount of time it takes to watch a movie. It turns out when you put the most brutally fast consumer graphics card on the task of, uh, brute-forcing 8-character passwords, it can crack a numbers-only string in 3 hours. Such is the finding of Hive Systems, a cybersecurity firm based in Virginia, as part of the research that went into its 2025 password table. The chart shows how fast a "consumer budget" hacker could brute-force passwords of varying lengths (4 to 18 characters) and compositions (e.g., numbers only, lowercase letters, uppercase and lowercase letters, etc.).

artificial intelligence, natural language, password, (19 more...)

PCWorld

Country: North America > United States > Virginia (0.25)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.36)

Add feedback

Identifying Multi-modal Knowledge Neurons in Pretrained Transformers via Two-stage Filtering

Sato, Yugen, Takagi, Tomohiro

arXiv.org Artificial IntelligenceMar-28-2025

Recent advances in large language models (LLMs) have led to the development of multimodal LLMs (MLLMs) in the fields of natural language processing (NLP) and computer vision. Although these models allow for integrated visual and language understanding, they present challenges such as opaque internal processing and the generation of hallucinations and misinformation. Therefore, there is a need for a method to clarify the location of knowledge in MLLMs. In this study, we propose a method to identify neurons associated with specific knowledge using MiniGPT-4, a Transformer-based MLLM. Specifically, we extract knowledge neurons through two stages: activation differences filtering using inpainting and gradient-based filtering using GradCAM. Experiments on the image caption generation task using the MS COCO 2017 dataset, BLEU, ROUGE, and BERTScore quantitative evaluation, and qualitative evaluation using an activation heatmap showed that our method is able to locate knowledge with higher accuracy than existing methods. This study contributes to the visualization and explainability of knowledge in MLLMs and shows the potential for future knowledge editing and control.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2503.22941

Country:

Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
Asia > Japan > Honshū > Kantō > Kanagawa Prefecture (0.04)
South America > Colombia > Meta Department > Villavicencio (0.04)
(3 more...)

Genre: Research Report > New Finding (0.89)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Safe RLHF-V: Safe Reinforcement Learning from Human Feedback in Multimodal Large Language Models

Ji, Jiaming, Chen, Xinyu, Pan, Rui, Zhu, Han, Zhang, Conghui, Li, Jiahao, Hong, Donghai, Chen, Boyuan, Zhou, Jiayi, Wang, Kaile, Dai, Juntao, Chan, Chi-Min, Han, Sirui, Guo, Yike, Yang, Yaodong

arXiv.org Artificial IntelligenceMar-22-2025

Multimodal large language models (MLLMs) are critical for developing general-purpose AI assistants, yet they face growing safety risks. How can we ensure that MLLMs are safely aligned to prevent undesired behaviors such as discrimination, misinformation, or violations of ethical standards? In a further step, we need to explore how to fine-tune MLLMs to enhance reasoning performance while ensuring they satisfy safety constraints. Fundamentally, this can be formulated as a min-max optimization problem. In this study, we propose Safe RLHF-V, the first multimodal safety alignment framework that jointly optimizes helpfulness and safety using separate multimodal reward and cost models within a Lagrangian-based constrained optimization framework. Given that there is a lack of preference datasets that separate helpfulness and safety in multimodal scenarios, we introduce BeaverTails-V, the first open-source dataset with dual preference annotations for helpfulness and safety, along with multi-level safety labels (minor, moderate, severe). Additionally, we design a Multi-level Guardrail System to proactively defend against unsafe queries and adversarial attacks. By applying the Beaver-Guard-V moderation for 5 rounds of filtering and re-generation on the precursor model, the overall safety of the upstream model is significantly improved by an average of 40.9%. Experimental results demonstrate that fine-tuning different MLLMs with Safe RLHF can effectively enhance model helpfulness while ensuring improved safety. Specifically, Safe RLHF-V improves model safety by 34.2% and helpfulness by 34.3%. All of datasets, models, and code can be found at https://github.com/SafeRLHF-V to support the safety development of MLLMs and reduce potential societal risks.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2503.17682

Country:

North America > United States > New York (0.04)
Asia > China > Hong Kong (0.04)

Genre: Research Report > New Finding (0.86)

Industry:

Law Enforcement & Public Safety (1.00)
Law (1.00)
Information Technology > Security & Privacy (1.00)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback